The dataset we are looking at is presented by Cortez et al. (see reference below), which contains the large collection (about 5000) of white wines with their quality evaluated by experts together with various physical or chemical properties, such as density, pH, alcohol, etc.
The goal of this project is to analyze and understand this dataset. In particular, we would like to find answers to following questions:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Below prints a simple summary of the data.
There are 4898 white wine observations in the dataset with 13 variables in total, including an index variable (named “X”), the “quality” variable, and 11 other variables describing the chemical properties of the wine.
The quality of the wine is an integer variable which has has a min 3.0 and max 9.0, with a median 6.0 and mean 5.878.
All the chemical property variables are floating numbers. They are of different unit and therefore lie in widely different range. For example, the chlorides variable has a small range from 0.009 to 0.346, while the total.sulfur.dioxide variable has a large range from 8.0 to 440.0.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The main features in the data set are alcohol and quality. I suspect alcohol and some combination of other variables can be used to build a predictive model to the wine quality.
Below we plot the histogram for the quality variable. The variable is discrete, but we can see its histogram has a typical normal distribution shape.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Similarly, we plot the histogram for the alcohol variable. The distribution is plotted at different binwidth, so that we can look at data with different “resolution”. At the coarse level (binwidth=1), we see that it follows a skewed distribution with most number of samples in [9, 10], followed by [10, 11], and then [11, 12], etc. At the fine level (binwidth=0.1), we see more irragularities of the distribution with multiple spikes, say at [9.0, 9.1], [9.5, 9.6], [10.0, 10.1], etc.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Features such as residual.sugar, sulphates, pH, chlorides will likely contribute to the wine quality and will support our investigation.
I created an ordered factor version of quality from its orignal integer version. Furthermore, I grouped the wine quality into 3 buckets [(3,4,5), (6), (7,8,9)] so that we get more samples in each bucket for better analysis.
df$quality.ordered <- as.ordered(df$quality)
df$quality.bucketed <- cut(df$quality, c(2, 5, 6, 10))
During the investigation, I found the distribution of chlorides variable has an unusual distribution. From the histogram shown below, we see that the majority of samples lie in the range of [0, 0.1] in a normal distribution shape, but there are a small number of outliers that lie far beyond this normal range (up to 0.34), which indicates this is a long-tail distribution.
In order to better visualize this distribution, we tried two approaches 1. Cut off the samples that are beyond 0.1, and only “zoom in” to look at those in the “regular range”; 2. Plot the distribution in a log10 scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The normal range of fixed.acidity is 5.0 to 10.0. There are a small number of outliers that have values larger than this range.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The normal range of residual.sugar is 0.0 to 20.0. Again, there are a few outliers with values much larger than this range (up to 65.8).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The normal range of density is 0.99 to 1.00. Most of the samples are within this range, with a few out of the range but not significantly larger (up to 1.039).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The normal range of sulphates is 0.2 to 1.0. Almost all samples are within this “normal range”, with a few exceptions just outside of this range but not far away.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
We make a box plot for alcohol level for each different quality below.
We can see that there is a clear dependency between alcohol and quality: the alcohol level tends to be high for both low quality and high quality wines, but low for medium quality wines. This is a very interesting observation to myself.
Also, we see that the highest quality wine (9) has quite concentrated alcohol level, in other words, the variance of alcohol level for wine of this quality is low. Later I realized that this is because there are very few samples (5 in total) with quality score being 9, and therefore the small variance could partly be attributed to lack of data.
## Correlation: 0.4355747
We found that there is a weak inverse correlation between chlorides and wine quality. From the figure below, we see that apart from the lowest and highest qualities where we have relatively small number of data points, the rest of the wines tend to have a higher quality when its chlorides level is lower.
## Correlation: -0.2099344
We compute correlation of quality against each individual feature in the data set, and print the result table below. We see that alcohol has strongest correlation (0.435) with quality, and density has strongest negative correlation (-0.307). The latter is not expected before analyzing the dataset.
## fixed.acidity volatile.acidity citric.acid
## -0.113662831 -0.194722969 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## -0.097576829 -0.209934411 0.008158067
## total.sulfur.dioxide density pH
## -0.174737218 -0.307123313 0.099427246
## sulphates alcohol
## 0.053677877 0.435574715
There is a negative correlation between density and quality. This is partly exhibited in the box plot visualization, e.g. the highest quality samples have lowest density quantiles. We also see the two density outliers (values larger than 1.005) are with medium quality value 6.
## Correlation: -0.3071233
There is a very small positive correlation between sulphates and quality, which is also supported by the box plot. We can see there is not significant change of sulphates quantiles or median amoung different wine qualities.
## $title
## [1] "Quality v.s. sulphates"
##
## attr(,"class")
## [1] "labels"
## Correlation: 0.05367788
There is a small negative correlation between sulphates and quality, which is supported by the box plot. As the quality of the wine increases (from left to right), the median and quantiles of fixed acidity slightly decreases, with the exception of highest quality wine, which in fact have a relatively large fixed acidity comparing to others.
## Correlation: -0.1136628
We can see that there is a strong negative correlation between density and alcohol, mostly because alcohol itself has smaller density than water (which makes majority of the wine). The scatter plot confirms this observation, and also shows two outliers with large density but not extraordinary alcohol level.
## Correlation: -0.7801376
I expect that there should be some correlation between volatile acidity and fixed acidity, because they are somehow chemically related (according to my very limited chemistry knowledge). The visualization below shows that the correlation is in fact very low, which means the two properties are not closely related. For example, it is normal to for a wine sample to have high fixed acidity but relatively low volatile acidity, or vice versa (high fixed acidity with high volatile acidity as well).
## Correlation: -0.02269729
I plotted the “non-free” sulfur dioxide (computed as the difference of total sulfur dioxide and free sulfur dioxide) versus the free sulfur dioxide. The results suggest that there is a weak correlation: for samples with high level of free sulfur dioxide, their level of “non-free” sulfur dioxide is usually also high (although not always).
## Correlation: 0.2635373
As the “volatile acidity v.s fixed acidity”, the correlation between sulphates and total sulfur dioxide is also very small. The visualization plot also confirms this claim: total sulfur dioxide level of a sample has very small predictive power to the sample’s sulphates level.
## Correlation: 0.05921725
We plot the chlorides with respect to sulphates in the figure below, and grouped and colored by different wine quality. From this plot we see that conditioned on wine quality group, the chlorides is mostly independent (constant) with respect to sulphates. Also we see that low quality wine tends to have higher chlorides level while high quality wine tends to have lower chlorides level, despite the sulphates roughly span the same range for each quality group.
We also added the scatter plot of all data points, and we can see the variation of chlorides given sulphates is quite large, but the general trend is visible: low quality wines (red points) tend to have larger chlorides than high quality wines (blue points).
One of the interesting and somewhat surprising fact I found is that the relations of most pairs of features are independent on wine quality. Take the following point for example, when plotting density against alcohol, grouped by different quality, we see that they mostly follow the same decreasing relationship, and the curves are actually very close or indistinguishable from each other. After thinking about it more, I believe it is reasonable that the relationships of one physical/chemical property against another physical/chemical property are mostly consistent and independent to quality, because this is usually governed by the laws of physics/chemistry instead of human taste. For example, the more alcohol some wine contains, the lighter (smaller density) that it will be, because alcohol has smaller density than water (which makes most part of the wine), and this fact holds regardless how the wine tastes.
In this plot we draw the histogram and density of alcohol level. The binwidth of the histogram is set to 0.2, and the density is estimated with a Gaussian kernel with default adjust=1.
From the visualization we see that the alcohol level in the sample set is asymmetric (not normal distribution). More specifically, we see that the it is skewed towards the lower end, that is there are more wines with lower alcohol level (9 to 10) than those with higher alcohol level (11 to 12).
## Correlation: 0.4355747
In this plot we draw the quality of wine v.s. the their chlorides level. We use a scatter plot with alpha=0.5 plus some jittering to show visualize the actual distribution of the alcohol and different quality level. In addition, we also plotted the 10% and 90% quality (blue bars) together with the median (red cross) for better visualizing the general trend of data.
From the exploration above, we found that the alcohol is the feature with largest correlation (0.435) to wine quality amoung all the given features. We can see that for wine samples of quality 5 or larger, the quality gets better as the median alcohol level grows (the red cross drifts rightwards). However, we also see that low quality wines (3 and 4) also tends to have higher alcohol level.
This observation is very interesting and also reasonable to myself: usually people like the taste of “good alcohol” from wine, the one that generated from fruit fermenting for a long enough time; but there are also manufacturers trying to artifically boost the alcohol level of their wine, in which case the tasting experts (and a lot of ordinary people) will be able to tell.
In this plot we draw a scatter plot of alcohol versus residual sugar, colored by the wine quality and super-imposed with the median curve.
From this plot we can see some distinct phenomenons of combining to different features to make better prediction about the wine quality. For example, at the residual sugar range below 10, there is clear trends that the higher alcohol level is, the better wine quality tends to be. After crossing that residual sugar level, all wine tends to have low alcohol, and its distinguish power is diminished. This effect is not only visible from the median statistics, but also from the scatter plot: at left-half of the plot (low residual sugar), blue points (high quality wines) tend to sits higher (large alcohol level) than the green and red points.
I have several take-home message from this project:
Understanding the range of data is very important. It is usually very helpful to first plot the histogram of the variables in order to get a sense of how well they are distributed, and decide a reasonable axis scale to present them. Without such a step, the result visualization can be very skewed and hard to interpret.
Having a reasonable size for the dataset is important. When there are too few number of data point, the statistical analysis might be less reliable. For example, there are only 5 samples of quality 9 wine, and a box plot or quantile computed from this 5 samples might not be as robust as the one that is from, say 500 samples.
Some unexpected results is not necessarily wrong; they might just be the fact that we overlooked before. For example, I expect conditioned on wine quality, the curves of one physical/chemical property against another should be distinguishable from another. This however is not true, as discovered from analysis, those relationships are often governed by physical/chemical laws and therefore not very dependent on human tastes.
Finding meaningful statistics and making good visualizaiton usually takes great efforts to explore different features and experiment with different visual cues. Sometimes the relationship I expected is wrong (as in the previous point), sometimes the intuition is correct but I just does not have the right visualization (e.g. the axes limit or scale is wrong, the binwidth is inappropriate, etc.). I realized that the more trial and error I did, usually the more I can make sense of the data and find interesting trends from it.